[Feature] implement async LoRA prefetch#14190
Closed
glenliu21 wants to merge 5 commits intosgl-project:mainfrom
Closed
[Feature] implement async LoRA prefetch#14190glenliu21 wants to merge 5 commits intosgl-project:mainfrom
glenliu21 wants to merge 5 commits intosgl-project:mainfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR addresses #8712. I used the prefetch policy described in S-Lora, where LoRA adapters are prefetched based on what requests are on the Scheduler's waiting queue.
Modifications
max_loras_prefetchas a server argumentForwardBatchas a LoRA prefetch batch, which consists of requests that are next to be ran on theScheduler's waiting queueThreadPoolExecutorand a separatetorch.cuda.Streamto enable async prefetchingAccuracy Tests
Benchmarking and Profiling
@ConnorLi96 ran the following commands to benchmark LoRA prefetching:
for i in {1..16}; do curl -s -X POST http://0.0.0.0:30001/load_lora_adapter -H 'Content-Type: application/json' -d "{\"lora_name\": \"adapter${i}\", \"lora_path\": \"/workspace/adapters/llama_3_1_8B_adapter\"}"; echo " ✓ adapter${i}"; donepython3 -m sglang.bench_serving --backend sglang --base-url http://localhost:30001/ --dataset-name random --num-prompts 100 --request-rate 4 --random-input-len 2048 --random-output-len 1024 --disable-ignore-eos --disable-tqdm --lora-name adapter1 adapter2 adapter3 adapter4 adapter5 adapter6 adapter7 adapter8 adapter9 adapter10 adapter11 adapter12 adapter13 adapter14 adapter15 adapter16This yielded the following results:
Before
After
These show about a 31% decrease in TTFT and a 27% decrease in E2E latency.
Checklist